perm filename B01.TEX[162,RWF] blob
sn#750182 filedate 1984-04-11 generic text, type C, neo UTF8
COMMENT ā VALID 00002 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 \rm
C00005 ENDMK
Cā;
\rm
\magnification=1200
\def\today{\ifcase\month\or
January\or February\or March\or April\or May\or June\or
July\or August\or September\or October\or November\or December\fi
\space\number\day, \number\year}
\line{\sevenrm 162B01.tex[v2,rwf] \today\hfill}
\vskip .225in
\noindent
CS162
\vskip .225in
\noindent
An entropy problem
We have a large file on external memory to sort. It contains $n$ records,
and has original entropy $\lg(n!)$, since there are $n!$ different
permutations which might be required. External memory is organized in
{\sl pages} of size $p$ records, which can be brought in or out only as
units; fast memory, where all comparisons and computations are done, has
room for only a few pages. We therefore decide on this outline of an
algorithm. We maintain two page-sized output buffers and one input buffer
in fast memory. We read in data, then, by whatever means, put each word
into one of the output buffers. When an output buffer is full, we return
its page to external memory. The last time we refer to a page, we sort
it.
Each decision of which buffer to put a record in gains at most one bit of
information. The final within-page sorts gain information at most
$(n/p)\lg(p!)$, or about $n\lg p$. The total information gained must be
at least equal, on the average, to the entropy $\lg (n!)$, or about $n\lg
n$, of the problem, so the number of times a record is placed in a buffer
must be at least $n\lg n-n\lg p=n\lg (n/p)$; if we organize the algorithm
in passes, each of which sees each record once, the number of passes is
$n\lg(n/p)/n=\lg(n/p)$. Thus any algorithm we can program to work in
$\lg(n/p)$ passes is close to the maximum achievable efficiency.
\vfill \end